Overview

This analysis performs differential gene expression analysis on RNA-seq data from maize leaf samples across multiple experimental factors:

  • Leaf tissue position: Continuous gradient from apical to basal (leaf 1–4)
  • Phosphorus treatment: High P (+P) vs Low P (-P)
  • Genotype: Control (CTRL) vs Inversion 4m (Inv4m)

The analysis uses the limma-voom pipeline to identify genes that respond to:

  • Inv4m: Genotype effect (Inversion 4m vs Control)
  • Leaf: Leaf stage effect
  • -P: Phosphorus deficiency effect
  • Leaf:-P: Interaction between leaf position and P treatment

Linear Model Specification for Gene Expression Analysis

This model describes the log-transformed expression \(y_{g,i}\) of gene \(g\) in sample \(i\) as a function of spatial, developmental, and treatment effects. Each gene is fitted separately using a common design matrix, producing gene-specific regression coefficients.

\[ \begin{aligned} y_{g,i} &= \beta_{g,0} + \beta_{g,R}\text{Row}_i + \beta_{g,L}\text{leafc}_i + \beta_{g,T}T_i \\ &\quad + \beta_{g,L\times T}(\text{leafc}_i \times T_i) + \beta_{g,G}G_i \\ &\quad + \beta_{g,G\times T}(G_i \times T_i) + \beta_{g,G\times L}(G_i \times \text{leafc}_i) \\ &\quad + u_{g,\text{col}(i):T(i)} + \varepsilon_{g,i} \end{aligned} \]

with,

\[ u_{g,\text{col}(i):T(i)} \sim N(0,\sigma^2_{u,g}), \quad \varepsilon_{g,i} \sim N(0,\phi_i \sigma^2_g) \]

Explanation of Model Components

Symbol Description Varies Over Notes
\(y_{g,i}\) \(\log_2\)(CPM) expression of gene \(g\) in sample \(i\) genes, samples Response variable
\(\beta_{g,0}\) Intercept (mean expression at centered leaf stage, +P treatment) genes Baseline expression
\(\beta_{g,R}\) Fixed effect of plot row genes Corrects for field spatial gradients
\(\beta_{g,L}\) Slope for leaf developmental stage (centered) genes Represents linear change with leaf age
\(\beta_{g,T}\) Effect of phosphorus treatment (−P vs. +P) genes Treatment main effect
\(\beta_{g,L\times T}\) Interaction between leaf stage and treatment genes Tests whether developmental slope differs under P deficiency
\(\beta_{g,G}\) Effect of inversion genotype genes Genotypic main effect
\(\beta_{g,G\times T}\) Interaction between inversion and treatment genes Tests whether inversion modifies P response
\(\beta_{g,G\times L}\) Interaction between inversion and leaf genes Tests whether the effect of the leaf depends on Inv4m
\(u_{g,\text{col}(i):T(i)}\) Random effect of plot column nested within treatment samples (within treatment) Accounts for spatial autocorrelation unique to each treatment field
\(\varepsilon_{g,i}\) Residual error with voom-derived precision weights samples Captures heteroskedastic measurement error

Notes on Indexing and Interpretation

  • \(i\) indexes samples, not genes. Predictors like Row, leaf_c, T, and G vary at the sample level.
  • \(g\) indexes genes, with each gene having its own set of regression coefficients \(\beta_{g,*}\).
  • The term \(u_{g,\text{col}(i):T(i)}\) introduces a treatment-specific random spatial correction, avoiding confounding between column and treatment.
  • The residual term \(\varepsilon_{g,i}\) incorporates precision weights from the voom transformation, where samples with lower measurement variance receive higher weight.
  • This parametrization improves interpretability by centering leaf_c (so the intercept represents average expression) and nesting the spatial correction by treatment rather than imposing a global linear spatial trend.

Fold Change Thresholds

  • Leaf effect: ±0.5 log2FC
  • P × Leaf interaction: ±0.5 log2FC
  • Inv4m × P interaction: ±0.5 log2FC
  • Inv4m × Leaf interaction: ±0.5 log2FC
  • Other effects (-P, Inv4m): ±1.5 log2FC

Key Genomic Regions

  • Inv4m inversion: Chr4:172,883,675–188,132,113
  • Shared introgression: Chr4:157,012,149–195,900,523

Genes are classified as:

  • in_Inv4m: Within the inversion proper
  • in_cis: Within the shared introgression (includes inversion + flanking)
  • in_trans: Outside the shared introgression region

DEG Classification Tiers

Three levels of stringency for different analyses:

  • Significant DEGs: for Manhattan plots, FDR < 0.05 (is_DEG).
  • High-confidence DEGs: for GO enrichment, FDR < 0.05 AND \(\log_2FC > \pm1.5\) (is_hiconf_DEG).
  • Selected DEGs: for manuscript main tables, the union (\(\cup\)) of the top 20 by significance and the top 20 by Mahalanobis distance (is_selected_DEG).

Libraries

library(edgeR)
library(limma)
library(rtracklayer)
library(GenomicRanges)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(ggpubr)
library(ggfx)
library(ggtext)
library(robustbase)

Data Import

Load Expression Counts

counts <- read.csv(file.path(paths$data, "inv4mRNAseq_gene_sample_exp.csv"))
{
  cat("Loaded expression\n")
  cat("  Dimensions:", dim(counts), "\n")
  cat("  Genes:", nrow(counts), "\n")
  cat("  Samples:", ncol(counts) - 2, "\n")
}
## Loaded expression
##   Dimensions: 39756 62 
##   Genes: 39756 
##   Samples: 60

Load Sample Metadata

sampleInfo <- read.csv(file.path(paths$data, "PSU2022_metadata.csv"))
# Recode Treatment: High_P -> +P, Low_P -> -P
sampleInfo$Treatment <- factor(sampleInfo$Treatment,
                               levels = c("High_P", "Low_P"),
                               labels = c("+P", "-P"))
# Recode Genotype: CTRL -> CTRL, INV4m -> Inv4m
sampleInfo$Genotype <- factor(sampleInfo$Genotype,
                              levels = c("CTRL", "INV4m"),
                              labels = c("CTRL", "Inv4m"))
# RNA_Plant already exists in new metadata (actual plant field IDs)
{
  cat("\nSample meta\n")
  cat("  Total samples:", nrow(sampleInfo), "\n")
  cat("  Genotypes:", paste(unique(sampleInfo$Genotype), collapse = ", "), "\n")
  cat("  Treatments:", paste(unique(sampleInfo$Treatment), collapse = ", "), "\n")
}
## 
## Sample meta
##   Total samples: 64 
##   Genotypes: CTRL, Inv4m 
##   Treatments: -P, +P

Prepare Count Matrix

# Extract gene IDs
gene_ids <- data.frame(gene = counts[, 2])

# Convert to matrix and set row names
counts <- as.matrix(counts[, -c(1:2)])
rownames(counts) <- gene_ids$gene

# Use library column for sample matching (side_tag has Excel errors)
sampleNames <- colnames(counts)

# Reorder metadata to match count matrix column order
sampleInfo <- sampleInfo[match(sampleNames, sampleInfo$library), ]

{
  cat("\nAll samples in metadata:",
      all(sampleNames %in% sampleInfo$library), "\n")
  cat("Count matrix prepared:\n")
  cat("  Genes:", nrow(counts), "\n")
  cat("  Samples:", ncol(counts), "\n")
}
## 
## All samples in metadata: TRUE 
## Count matrix prepared:
##   Genes: 39756 
##   Samples: 60

Create DGEList Object

# Create DGEList with counts and sample information
y <- DGEList(counts = counts, samples = sampleInfo)

# Define groups from Treatment and Genotype interaction
y$group <- interaction(y$samples$Treatment, y$samples$Genotype)

cat("\nDGEList object created\n")
## 
## DGEList object created
head(y$samples)
##             group lib.size norm.factors     library top_tag rna_ctn rna_batch1
## L01_S1_L002     1   327979            1 L01_S1_L002     L01      49          1
## L02_S2_L002     1   798505            1 L02_S2_L002     L02     101          1
## L03_S3_L002     1  2398804            1 L03_S3_L002     L03      41          1
## L04_S4_L002     1  2476414            1 L04_S4_L002     L04      94          1
## L05_S5_L002     1 31396360            1 L05_S5_L002     L05      42          1
## L06_S6_L002     1 30520111            1 L06_S6_L002     L06      63          1
##             tube_n side NCSU_RNA_plant Treatment Genotype leaf_tissue leaf_node
## L01_S1_L002      1    L              1        -P     CTRL           1         1
## L02_S2_L002      2    L              1        -P     CTRL           2         3
## L03_S3_L002      3    L              1        -P     CTRL           3         5
## L04_S4_L002      4    L              1        -P     CTRL           4         7
## L05_S5_L002      5    L              2        -P    Inv4m           1         1
## L06_S6_L002      6    L              2        -P    Inv4m           2         3
##                side_tag       TIME decimal_time  row COLLECTOR NOTE  S   Plot
## L01_S1_L002 -P1_B73_L1L 12H 37M 0S     12.61667 3056    SERGIO       1 PlotVI
## L02_S2_L002 -P1_B73_L2L 12H 42M 0S     12.70000 3056    SERGIO       2 PlotVI
## L03_S3_L002 -P1_B73_L3L 12H 43M 0S     12.71667 3056    SERGIO       3 PlotVI
## L04_S4_L002 -P1_B73_L4L 12H 44M 0S     12.73333 3056    SERGIO       4 PlotVI
## L05_S5_L002 -P1MICH_L1L 12H 46M 0S     12.76667 3059    SERGIO      NA PlotVI
## L06_S6_L002 -P1MICH_L2L 12H 48M 0S     12.80000 3059    SERGIO      NA PlotVI
##             Rep Plot_Row Plot_Column std_count_06102022 RNA_Q     borde notes
## L01_S1_L002   6        3           3                  4     3 variacion    NA
## L02_S2_L002   6        3           3                  4     3 variacion    NA
## L03_S3_L002   6        3           3                  4     3 variacion    NA
## L04_S4_L002   6        3           3                  4     3 variacion    NA
## L05_S5_L002   6        4           3                  4     5              NA
## L06_S6_L002   6        4           3                  4     5              NA
##             leaves_per_plant RNA_Plant
## L01_S1_L002                7        26
## L02_S2_L002                7        26
## L03_S3_L002                7        26
## L04_S4_L002                7        26
## L05_S5_L002                8        16
## L06_S6_L002                8        16

Expression Filtering and Sample Quality Control

Filter Genes by Expression Level

Using filterByExpr to remove genes with low counts across samples.

# Keep genes with sufficient expression
keep <- filterByExpr(y, group = y$group)
y_filtered <- y[keep, ]

{
  cat("\nExpression filtering:\n")
  cat("  Genes before:", nrow(y), "\n")
  cat("  Genes after:", nrow(y_filtered), "\n")
  cat("  Genes removed:", sum(!keep), "\n")
}
## 
## Expression filtering:
##   Genes before: 39756 
##   Genes after: 24249 
##   Genes removed: 15507

Initial MDS: Quality Control Check

Compute MDS on all libraries (after gene filtering but before sample filtering) to identify low-quality samples based on library size.

# MDS with all samples (before library size filtering)
mds_all <- plotMDS(y_filtered, pch = 21, plot = TRUE)

# Histogram of library sizes
hist(
  y$samples$lib.size / 1e6,
  main = "Library Size Distribution",
  xlab = "Library Size (millions of reads)",
  breaks = 20
)

{
  cat("\nLibrary size summary:\n")
  print(summary(y$samples$lib.size))
  cat("\nSamples with lib.size < 20 million:",
      sum(y$samples$lib.size < 2e7), "\n")
}
## 
## Library size summary:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   327979 10716231 31446964 27205445 36129276 56805083 
## 
## Samples with lib.size < 20 million: 17
# Prepare data for plotting
mds_qc_data <- y_filtered$samples %>%
  mutate(
    dim1 = mds_all$x,
    dim2 = mds_all$y,
    lib_size_millions = lib.size / 1e6
  )

# Plot MDS colored by library size
ggplot(mds_qc_data, aes(x = dim1, y = dim2, color = lib_size_millions)) +
  geom_point(size = 3) +
  scale_color_viridis_c(option = "plasma") +
  labs(
    x = paste0("Dim1 (", round(100 * mds_all$var.explained[1]), "%)"),
    y = paste0("Dim2 (", round(100 * mds_all$var.explained[2]), "%)"),
    color = "Library Size\n(millions)"
  ) +
  theme_classic2(base_size = 15)

Filter Low Quality Libraries

Samples with library size < 20 million reads are considered low quality.

# Flag low count libraries
y_filtered$samples$lowCount <- y_filtered$samples$lib.size < 2e7

# Remove low quality samples
y_filtered_bySample <- y_filtered[, !y_filtered$samples$lowCount]

{
  cat("\nLow quality libraries:\n")
  print(table(y_filtered$samples$lowCount))
  cat("\nSamples after filtering:\n")
  cat("  Retained:", ncol(y_filtered_bySample), "\n")
  cat("  Removed:", sum(y_filtered$samples$lowCount), "\n")
}
## 
## Low quality libraries:
## 
## FALSE  TRUE 
##    43    17 
## 
## Samples after filtering:
##   Retained: 43 
##   Removed: 17

Sample Distribution Diagnostics

Verify experimental design balance across factors after quality filtering.

with(y_filtered_bySample$samples, {
  cat("\n=== Sample Distribution by Factors ===\n")
  cat("\n-- Treatment --\n")
  print(table(Treatment))
  cat("\n-- Genotype --\n")
  print(table(Genotype))
  cat("\n-- Leaf Tissue --\n")
  print(table(leaf_tissue))
  cat("\n-- Treatment × Leaf Tissue --\n")
  print(table(Treatment, leaf_tissue))
  cat("\n-- Genotype × Leaf Tissue --\n")
  print(table(Genotype, leaf_tissue))
  cat("\n-- Treatment × Genotype × Leaf Tissue (3-way) --\n")
  print(table(Treatment, Genotype, leaf_tissue))
})
## 
## === Sample Distribution by Factors ===
## 
## -- Treatment --
## Treatment
## +P -P 
## 24 19 
## 
## -- Genotype --
## Genotype
##  CTRL Inv4m 
##    20    23 
## 
## -- Leaf Tissue --
## leaf_tissue
##  1  2  3  4 
## 11 11 11 10 
## 
## -- Treatment × Leaf Tissue --
##          leaf_tissue
## Treatment 1 2 3 4
##        +P 6 6 6 6
##        -P 5 5 5 4
## 
## -- Genotype × Leaf Tissue --
##         leaf_tissue
## Genotype 1 2 3 4
##    CTRL  4 6 5 5
##    Inv4m 7 5 6 5
## 
## -- Treatment × Genotype × Leaf Tissue (3-way) --
## , , leaf_tissue = 1
## 
##          Genotype
## Treatment CTRL Inv4m
##        +P    3     3
##        -P    1     4
## 
## , , leaf_tissue = 2
## 
##          Genotype
## Treatment CTRL Inv4m
##        +P    3     3
##        -P    3     2
## 
## , , leaf_tissue = 3
## 
##          Genotype
## Treatment CTRL Inv4m
##        +P    3     3
##        -P    2     3
## 
## , , leaf_tissue = 4
## 
##          Genotype
## Treatment CTRL Inv4m
##        +P    3     3
##        -P    2     2

Quality Control: Batch Effects

Compute MDS Dimensions

MDS reveals sources of variation in gene expression across samples. Dimensions 3 and 4 are extracted from eigenvectors scaled by square root of variance explained.

mds <- plotMDS(

  y_filtered_bySample,
  pch = 21,
  label = y_filtered_bySample$samples$library,
  plot = FALSE
)

# Store MDS coordinates in sample data
d <- y_filtered_bySample$samples
d$dim1 <- mds$x
d$dim2 <- mds$y
d$dim3 <- mds$eigen.vectors[, 3] * sqrt(mds$var.explained[3])
d$dim4 <- mds$eigen.vectors[, 4] * sqrt(mds$var.explained[4])

# Prepare factors for plotting
d$Treatment <- factor(d$Treatment)
d$Genotype <- factor(d$Genotype)
d$RNA_Plant <- factor(d$RNA_Plant)

# Variance explained by each dimension
tibble(
  dimension = paste("Dim", 1:4),
  var_explained = round(mds$var.explained[1:4], 4)
)
## # A tibble: 4 × 2
##   dimension var_explained
##   <chr>             <dbl>
## 1 Dim 1            0.265 
## 2 Dim 2            0.125 
## 3 Dim 3            0.0842
## 4 Dim 4            0.0724

MDS: Dimensions 1 and 2 Colored by Multiple Factors

Exploring which experimental factors drive the main dimensions of variation.

# Define custom green to orange palette for leaf positions
green_to_orange <- c(
  "#00954A",
  "#A4DF56",
  "#D7E23C",
  "#FF9A1F"
)

# Treatment
p1 <- ggplot(d, aes(x = dim1, y = dim2, color = Treatment)) +
  geom_point(size = 3) +
  theme_classic(base_size = 15) +
  theme(legend.position = "top")

# Row position
p2 <- ggplot(d, aes(x = dim1, y = dim2, color = row)) +
  geom_point(size = 3) +
  guides(
    color = guide_colourbar(
      title = "Row Position",
      direction = "horizontal",
      barwidth = 15,
      barheight = 1,
      title.position = "top",
      title.hjust = 0.5
    ),
    shape = "none"
  ) +
  theme_classic(base_size = 15) +
  theme(legend.position = "top")

# Collection time
p3 <- ggplot(d, aes(x = dim1, y = dim2, color = decimal_time)) +
  geom_point(size = 3) +
  guides(
    color = guide_colourbar(
      title = "Collection Time",
      direction = "horizontal",
      barwidth = 15,
      barheight = 1,
      title.position = "top",
      title.hjust = 0.5
    ),
    shape = "none"
  ) +
  theme_classic(base_size = 15) +
  theme(legend.position = "top")

# Collector
p4 <- ggplot(d, aes(x = dim1, y = dim2, color = COLLECTOR)) +
  geom_point(size = 3) +
  scale_color_discrete(labels = c("Team 1", "Team 2")) +
  theme_classic(base_size = 15) +
  theme(legend.position = "top") +
  labs(color = "Collector")

# Genotype
p5 <- ggplot(d, aes(x = dim1, y = dim2, color = Genotype)) +
  geom_point(size = 3) +
  theme_classic(base_size = 15) +
  theme(legend.position = "top") +
  labs(color = "Genotype")

# Leaf tissue
p6 <- d %>%
  mutate(leaf = factor(leaf_tissue)) %>%
  ggplot(aes(x = dim1, y = dim2, color = leaf)) +
  scale_color_manual(values = green_to_orange) +
  geom_point(size = 3) +
  theme_classic(base_size = 15) +
  theme(legend.position = "top") +
  labs(color = "Leaf")

ggarrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2)

MDS: Primary Plot (Leaf Tissue and Treatment)

transcript_MDS_12 <- d %>%
  mutate(leaf = factor(leaf_tissue)) %>%
  ggplot(aes(x = dim1, y = dim2)) +
  ggtitle("MDS Transcripts") +
  xlab(paste0("dim1 (", round(100 * mds$var.explained[1]), "%)")) +
  ylab(paste0("dim2 (", round(100 * mds$var.explained[2]), "%)")) +
  geom_point(aes(fill = leaf, shape = Treatment), size = 4) +
  scale_fill_manual(values = green_to_orange) +
  scale_shape_manual(values = c(21, 25)) +
  guides(
    shape = guide_legend(
      title = element_blank(),
      order = 2,
      override.aes = list(fill = "black"),
      reverse = TRUE
    ),
    fill = guide_legend(
      title = "Leaf",
      order = 1,
      override.aes = list(geom = "point", shape = 22, size = 7)
    )
  ) +
  theme_classic2(base_size = 30) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.spacing = unit(0, "line"),
    legend.box.spacing = unit(0, "in"),
    legend.background = element_rect(fill = "transparent")
  )

saveRDS(transcript_MDS_12, file.path(paths$intermediate, "transcript_MDS_12.rds"))
transcript_MDS_12

MDS: Dimensions 3 and 4

The third dimension separates by genotype.

# Set up genotype labels with italic formatting
labels <- c("CTRL", "*Inv4m*")
names(labels) <- c("CTRL", "INV4")

d %>%
  ggplot(aes(x = dim3, y = dim4, fill = Genotype, shape = Treatment)) +
  xlab(paste0("dim3 (", round(100 * mds$var.explained[3]), "%)")) +
  ylab(paste0("dim4 (", round(100 * mds$var.explained[4]), "%)")) +
  geom_point(size = 4) +
  scale_fill_viridis_d(direction = -1, labels = labels) +
  scale_shape_manual(values = c(24, 21)) +
  guides(
    shape = "none",
    fill = guide_legend(
      title = "Genotype",
      order = 2,
      override.aes = list(geom = "point", shape = 22, size = 7, reverse = TRUE)
    )
  ) +
  theme_classic2(base_size = 25) +
  theme(
    legend.position = c(0.89, 0.9),
    legend.text = element_markdown(),
    legend.spacing = unit(0, "line"),
    legend.box.spacing = unit(0, "line")
  )

MDS Dimension Correlations

# Calculate correlations and p-values
mds_cor_results <- tibble(
  dimension = c("Dim1", "Dim2", "Dim3"),
  factor = c("Treatment", "Leaf tissue", "Genotype"),
  correlation = c(
    cor(d$dim1, as.numeric(d$Treatment)),
    cor(d$dim2, d$leaf_tissue),
    cor(d$dim3, as.numeric(d$Genotype))
  ),
  p_value = c(
    cor.test(d$dim1, as.numeric(d$Treatment))$p.value,
    cor.test(d$dim2, d$leaf_tissue)$p.value,
    cor.test(d$dim3, as.numeric(d$Genotype))$p.value
  )
) %>%
  mutate(
    adj_p_value = p.adjust(p_value, method = "fdr"),
    correlation = round(correlation, 3),
    p_value = signif(p_value, 3),
    adj_p_value = signif(adj_p_value, 3)
  )

mds_cor_results
## # A tibble: 3 × 5
##   dimension factor      correlation  p_value adj_p_value
##   <chr>     <chr>             <dbl>    <dbl>       <dbl>
## 1 Dim1      Treatment         0.501 6.15e- 4    6.15e- 4
## 2 Dim2      Leaf tissue      -0.615 1.13e- 5    1.69e- 5
## 3 Dim3      Genotype          0.916 6.88e-18    2.06e-17

Normalization and Design Matrix

Apply TMM (Trimmed Mean of M-values) normalization to account for sequencing depth differences, then fit a linear model including spatial covariates (Plot_Row, Plot_Column), biological factors (Genotype, leaf_tissue, Treatment), and the leaf_tissue:Treatment interaction. Voom transformation estimates mean-variance relationships and computes precision weights for each observation.

# Center the continuous leaf stage variable
# Centering subtracts the mean leaf age so that:
#   - the intercept represents expression at the *average leaf age*,
#   - the Treatment coefficient corresponds to the treatment effect at mean leaf stage,
#   - and collinearity between leaf age and Treatment is minimized.
d$leaf_tissue_c <- scale(d$leaf_tissue, center = TRUE, scale = FALSE)

# Define block: column nested within treatment
block <- interaction(d$Treatment, d$Plot_Column)

# Design matrix: spatial covariates + biological factors + interactions
design <- model.matrix(
  ~ Plot_Row + leaf_tissue_c * Treatment + Genotype * Treatment + Genotype * leaf_tissue_c,
  d
)

# Voom transformation with precision weights
voomR <- voom(y_filtered_bySample, design = design, plot = FALSE)

# Save normalized expression for downstream analyses
saveRDS(voomR$E, file = file.path(paths$intermediate, "normalized_expression_logCPM_leaf_trt.rds"))
saveRDS(voomR, file = file.path(paths$intermediate, "normalized_expression_voom_object_leaf_trt.rds"))

{
  cat("Normalization factors range:",
      range(y_filtered_bySample$samples$norm.factors), "\n")
  cat("Design matrix:", nrow(design), "samples ×", ncol(design),
      "coefficients\n")
  design %>% as.data.frame() %>% tibble() %>% print()
  cat("Voom expression matrix:", nrow(voomR$E), "genes ×",
      ncol(voomR$E), "samples\n")
}
## Normalization factors range: 1 1 
## Design matrix: 43 samples × 8 coefficients
## # A tibble: 43 × 8
##    `(Intercept)` Plot_Row leaf_tissue_c `Treatment-P` GenotypeInv4m
##            <dbl>    <dbl>         <dbl>         <dbl>         <dbl>
##  1             1        4        -1.47              1             1
##  2             1        4        -0.465             1             1
##  3             1        4         0.535             1             1
##  4             1        4         1.53              1             1
##  5             1        6        -1.47              1             1
##  6             1        6         0.535             1             1
##  7             1        5        -1.47              1             0
##  8             1        5        -0.465             1             0
##  9             1        5         0.535             1             0
## 10             1        5         1.53              1             0
## # ℹ 33 more rows
## # ℹ 3 more variables: `leaf_tissue_c:Treatment-P` <dbl>,
## #   `Treatment-P:GenotypeInv4m` <dbl>, `leaf_tissue_c:GenotypeInv4m` <dbl>
## Voom expression matrix: 24249 genes × 43 samples

Linear Model Fitting

Fit linear model to voom-transformed data, then apply robust empirical Bayes moderation to stabilize variance estimates and improve power for differential expression testing.

# Estimate consensus correlation for blocks
corfit <- duplicateCorrelation(voomR, design, block = block)

# Fit linear model accounting for block correlation
fit <- lmFit(voomR, design, block = block, correlation = corfit$consensus.correlation)

# Apply empirical Bayes moderation
ebfit <- eBayes(fit)

{
  cat("Model fitted:", nrow(fit$coefficients), "genes ×",
      ncol(fit$coefficients), "coefficients\n")
  cat("\nSignificant genes per coefficient (FDR < 0.05):\n")
  print(colSums(abs(decideTests(ebfit))))
}
## Model fitted: 24249 genes × 8 coefficients
## 
## Significant genes per coefficient (FDR < 0.05):
##                 (Intercept)                    Plot_Row 
##                       19367                           0 
##               leaf_tissue_c                 Treatment-P 
##                        7888                        6171 
##               GenotypeInv4m   leaf_tissue_c:Treatment-P 
##                         778                        2570 
##   Treatment-P:GenotypeInv4m leaf_tissue_c:GenotypeInv4m 
##                           2                           1

Effect Estimation and Confidence Intervals

Extract Coefficients of Interest

We focus on biological effects while accounting for spatial covariates.

# Define predictors of interest with ORIGINAL names from model
predictors_original <- c(
  "leaf_tissue_c",
  "Treatment-P",

  "GenotypeINV4",
  "Treatment-P:GenotypeInv4m",
  "leaf_tissue_c:Treatment-P",
  "leaf_tissue_c:GenotypeInv4m"
)

topTable(ebfit) %>% colnames()
##  [1] "Plot_Row"                    "leaf_tissue_c"              
##  [3] "Treatment.P"                 "GenotypeInv4m"              
##  [5] "leaf_tissue_c.Treatment.P"   "Treatment.P.GenotypeInv4m"  
##  [7] "leaf_tissue_c.GenotypeInv4m" "AveExpr"                    
##  [9] "F"                           "P.Value"                    
## [11] "adj.P.Val"
predictors_toptable <- c(
  "leaf_tissue_c",
  "Treatment.P",
  "GenotypeINV4",
  "Treatment.P.GenotypeInv4m",
  "leaf_tissue_c.Treatment.P",
  "leaf_tissue_c.GenotypeInv4m"
)

# Define STANDARDIZED names for output
predictors_standard <- c(
  "Leaf",
  "-P",
  "Inv4m",
  "Leaf:-P",
  "Leaf:-P",
  "Leaf:Inv4m"
)

# Create mapping
predictor_map <- setNames(predictors_standard, predictors_original)

{
  cat("\nExtracting coefficients:\n")
  for (i in seq_along(predictors_original)) {
    cat("  ", predictors_original[i], "→", predictors_standard[i], "\n")
  }
}
## 
## Extracting coefficients:
##    leaf_tissue_c → Leaf 
##    Treatment-P → -P 
##    GenotypeINV4 → Inv4m 
##    Treatment-P:GenotypeInv4m → Leaf:-P 
##    leaf_tissue_c:Treatment-P → Leaf:-P 
##    leaf_tissue_c:GenotypeInv4m → Leaf:Inv4m

Calculate Effects and Confidence Intervals

For each predictor, extract results and calculate 95% confidence intervals.

#' Extract differential expression results for specified predictors
#'
#' @param ebfit An eBayes fitted model object from limma
#' @param predictor_map Named vector mapping coefficient names to display names
#'
#' @return A data frame with DE results for all specified predictors
extract_predictor_effects <- function(ebfit, predictor_map) {
  # Get coefficient names from the model
  coef_names <- colnames(coef(ebfit))

  # Match predictor names to coefficient positions
  coef_indices <- match(names(predictor_map), coef_names)

  # Validate all predictors found
  if (any(is.na(coef_indices))) {
    missing <- names(predictor_map)[is.na(coef_indices)]
    stop(
      "Coefficients not found: ", paste(missing, collapse = ", "),
      "\nAvailable: ", paste(coef_names, collapse = ", ")
    )
  }

  # Extract results for each predictor
  result_list <- lapply(seq_along(coef_indices), function(i) {
    idx <- coef_indices[i]
    tt <- topTable(ebfit, coef = idx, sort.by = "none", n = Inf)

    # Calculate 95% confidence intervals
    crit_value <- qt(0.975, ebfit$df.residual + ebfit$df.prior)
    std_errors <- ebfit$stdev.unscaled[, idx] * sqrt(ebfit$s2.post)

    tt$std_err <- std_errors
    tt$upper <- tt$logFC + crit_value * std_errors
    tt$lower <- tt$logFC - crit_value * std_errors
    tt$predictor <- predictor_map[i]
    tt$response <- rownames(tt)
    tt
  })

  # Combine and format
  effects <- do.call(rbind, result_list)
  rownames(effects) <- NULL

  effects %>%
    dplyr::select(predictor, response, everything()) %>%
    arrange(adj.P.Val)
}

# Map coefficients (use names from coef(), not topTable())
predictor_map <- c(
  "leaf_tissue_c" = "Leaf",
  "Treatment-P" = "-P",
  "GenotypeInv4m" = "Inv4m",
  "Treatment-P:GenotypeInv4m" = "Inv4m:-P",
  "leaf_tissue_c:Treatment-P" = "Leaf:-P",
  "leaf_tissue_c:GenotypeInv4m" = "Leaf:Inv4m"
)

# Verify coefficient names match
{
  cat("Available coefficients:\n")
  print(colnames(coef(ebfit)))
}
## Available coefficients:
## [1] "(Intercept)"                 "Plot_Row"                   
## [3] "leaf_tissue_c"               "Treatment-P"                
## [5] "GenotypeInv4m"               "leaf_tissue_c:Treatment-P"  
## [7] "Treatment-P:GenotypeInv4m"   "leaf_tissue_c:GenotypeInv4m"

Create Combined Effects Table

Combine all predictor results and annotate with gene information.

# Define factor level order for predictors
effect_order <- c("Leaf", "-P", "Leaf:-P", "Inv4m", "Inv4m:-P", "Leaf:Inv4m")

effects_df <- extract_predictor_effects(ebfit, predictor_map) %>%
  dplyr::rename(gene = "response") %>%
  mutate(predictor = factor(predictor, levels = effect_order)) %>%
  mutate(neglogP = -log10(adj.P.Val))

# Add DEG and regulation flags
effects_df <- effects_df %>%
  mutate(is_DEG = adj.P.Val < 0.05) %>%
  mutate(
    regulation = case_when(
      is_DEG & logFC > 0 ~ "Upregulated",
      is_DEG & logFC < 0 ~ "Downregulated",
      .default = "Unregulated"
    )
  )

{
  cat("\nTotal tests:", nrow(effects_df), "\n")
  cat("Tests per predictor:\n")
  print(table(effects_df$predictor))
  cat("\nSignificant per predictor (FDR < 0.05):\n")
  print(table(effects_df$predictor[effects_df$adj.P.Val < 0.05]))
  cat("\nCombined effects table created:\n")
  print(with(effects_df, table(predictor, regulation)))
}
## 
## Total tests: 145494 
## Tests per predictor:
## 
##       Leaf         -P    Leaf:-P      Inv4m   Inv4m:-P Leaf:Inv4m 
##      24249      24249      24249      24249      24249      24249 
## 
## Significant per predictor (FDR < 0.05):
## 
##       Leaf         -P    Leaf:-P      Inv4m   Inv4m:-P Leaf:Inv4m 
##       7888       6171       2570        778          2          1 
## 
## Combined effects table created:
##             regulation
## predictor    Downregulated Unregulated Upregulated
##   Leaf                3032       16361        4856
##   -P                  3961       18078        2210
##   Leaf:-P             1161       21679        1409
##   Inv4m                540       23471         238
##   Inv4m:-P               2       24247           0
##   Leaf:Inv4m             1       24248           0

Robust Outlier Detection

Identifies extreme differentially expressed genes using robust Mahalanobis distance based on the Minimum Covariance Determinant (MCD) method. This approach is resistant to the influence of outliers themselves, providing more reliable outlier detection than classical methods.

The MCD method estimates location and covariance using only the most central 75% of observations (alpha = 0.75), making it robust to contamination.

#' Calculate robust Mahalanobis distance for one predictor
#'
#' @param per_predictor Data frame for a single predictor
#' @param mcd_alpha Alpha parameter for MCD estimation
#'
#' @return Data frame with mahalanobis column added
calculate_robust_distance <- function(per_predictor, mcd_alpha) {
  # Extract bivariate data (logFC and -log10(FDR))
  bivariate <- per_predictor %>%
    dplyr::select(logFC, neglogP) %>%
    as.matrix()

  # Compute robust location and covariance using MCD
  mcd_result <- covMcd(bivariate, alpha = mcd_alpha)

  # Calculate robust Mahalanobis distances
  per_predictor$mahalanobis <- mahalanobis(
    x = bivariate,
    center = mcd_result$center,
    cov = mcd_result$cov
  )

  per_predictor
}

#' Add robust Mahalanobis outlier flags
#'
#' @param data Data frame with effects
#' @param distance_quantile Quantile threshold for outlier detection
#' @param FDR FDR threshold for significance
#' @param mcd_alpha Alpha parameter for MCD estimation
#'
#' @return Data frame with mahalanobis and is_mh_outlier columns
add_mahalanobis_outliers <- function(
    data = NULL,
    distance_quantile = 0.05,
    FDR = 0.05,
    mcd_alpha = 0.75
) {
  # Calculate robust Mahalanobis distance per predictor
  data <- split(data, factor(data$predictor)) %>%
    lapply(calculate_robust_distance, mcd_alpha = mcd_alpha) %>%
    bind_rows()

  # Chi-square cutoff for bivariate data (df = 2)
  cutoff <- qchisq(p = 1 - distance_quantile, df = 2)

  # Flag outliers: significant AND extreme distance
  data$is_mh_outlier <- (data$adj.P.Val < FDR) & (data$mahalanobis > cutoff)

  # Sort by distance within groups
  data %>%
    ungroup() %>%
    group_by(predictor, regulation) %>%
    arrange(-mahalanobis, .by_group = TRUE) %>%
    ungroup()
}

Calculate Mahalanobis Distances and Outlier Flags

effects_df <- add_mahalanobis_outliers(
  effects_df,
  distance_quantile = 0.05,
  FDR = 0.05
)

{
  cat("\nMahalanobis outliers detected:\n")
  print(with(effects_df, table(predictor, is_mh_outlier)))
}
## 
## Mahalanobis outliers detected:
##             is_mh_outlier
## predictor    FALSE  TRUE
##   Leaf       19461  4788
##   -P         19650  4599
##   Leaf:-P    21679  2570
##   Inv4m      23471   778
##   Inv4m:-P   24247     2
##   Leaf:Inv4m 24248     1

Gene Annotation

Load gene symbols, functional descriptions (Pannzer), and genomic coordinates (B73 v5 GFF3). Gene IDs are cleaned and coordinates imported as both GRanges (for overlap operations) and data.frame (for dplyr filtering).

# Gene symbols and locus names
gene_symbol <- read.table(
  file.path(paths$data, "gene_symbol.tab"),
  quote = "",
  header = TRUE,
  sep = "\t",
  na.strings = ""
)

# Pannzer functional annotations
pannzer <- read.table(
  file.path(paths$data, "PANNZER_DESC.tab"),
  quote = "",
  header = TRUE,
  sep = "\t"
) %>%
  group_by(gene_model) %>%
  dplyr::slice(1) %>%
  dplyr::select(gene_model, desc)

# Create unique gene symbol table
gene_symbol_unique <- gene_symbol %>%
  group_by(gene_model, locus_symbol) %>%
  dplyr::slice(1) %>%
  ungroup()

# Merge annotations
gene_pannzer <- gene_symbol_unique %>%
  left_join(pannzer, by = c("gene_model" = "gene_model")) %>%
  group_by(gene_model) %>%
  dplyr::slice(1) %>%
  ungroup() %>%
  dplyr::select(gene_model, locus_symbol, locus_name, desc)

# Genomic coordinates (GRanges + data.frame)
v5_gff_file <- file.path(paths$data, "Zea_mays.Zm-B73-REFERENCE-NAM-5.0.59.chr.gff3")
genes_gr <- rtracklayer::import(v5_gff_file) %>%
  subset(type == "gene" & seqnames %in% 1:10)
genes <- as.data.frame(genes_gr)
genes$ID <- gsub("gene:", "", genes$ID)

{
  cat("Annotations loaded:\n")
  cat("  Gene symbols:", nrow(gene_symbol), "\n")
  cat("  Pannzer descriptions:", nrow(pannzer), "\n")
  cat("  Merged annotations:", nrow(gene_pannzer), "\n")
  cat("  Genomic features:", nrow(genes), "genes across",
      length(unique(genes$seqnames)), "chromosomes\n")
}
## Annotations loaded:
##   Gene symbols: 44364 
##   Pannzer descriptions: 28308 
##   Merged annotations: 44303 
##   Genomic features: 43459 genes across 10 chromosomes

Define Genomic Regions of Interest

Define three nested regions on chromosome 4: (1) Inv4m inversion proper (gene-defined boundaries), (2) shared introgression including flanking regions (manually verified from genotyping data), and (3) flanking regions (in introgression but outside inversion).

# Inv4m inversion boundaries (defined by specific genes)
inv4m_start <- genes[genes$ID == "Zm00001eb190470", "start"]
inv4m_end <- genes[genes$ID == "Zm00001eb194800", "end"]

# Shared introgression boundaries (from RNAseq genotype verification)
introgression_start <- 157012149
introgression_end <- 195900523

# Extract gene IDs for each region
inv4m_gene_ids <- genes %>%
  filter(seqnames == 4, start >= inv4m_start, end <= inv4m_end) %>%
  pull(ID)

shared_introgression_gene_ids <- genes %>%
  filter(seqnames == 4, start >= introgression_start, end <= introgression_end) %>%
  pull(ID)

flanking_introgression_gene_ids <- shared_introgression_gene_ids[
  !(shared_introgression_gene_ids %in% inv4m_gene_ids)
]

{
  cat("Inv4m inversion: Chr4:", inv4m_start, "-", inv4m_end,
      "(", length(inv4m_gene_ids), "genes )\n")
  cat("Shared introgression: Chr4:", introgression_start, "-", introgression_end,
      "(", length(shared_introgression_gene_ids), "genes )\n")
  cat("Introgressed flanking:", length(flanking_introgression_gene_ids), "genes\n")
}
## Inv4m inversion: Chr4: 172883675 - 188132113 ( 434 genes )
## Shared introgression: Chr4: 157012149 - 195900523 ( 1099 genes )
## Introgressed flanking: 665 genes

Three-Tier DEG Classification

The goal of this section is to classify differentially expressed genes (DEGs) into three tiers of stringency based on significance (FDR) and effect size (log2FC), as well as a final set of genes selected for deep-dive analysis.

Methodology for Fold Change Thresholds

The effect sizes are interpreted differently based on the predictor:

  • Leaf Position (leaf predictor): This was modeled as a continuous slope (\(\beta\)). To be considered a large effect, the total change across the 3 units (Leaf 1 → Leaf 4) must meet the \(|\log_2FC|>3/2\) criterion. $ || = |3| > 3/2 || > 1/2 = 0.5 $ We use \(|\beta| > 0.5\) as the threshold for the leaf slope and interaction with -P.

  • Categorical Predictors (-P, Inv4m, Interaction): The standard large effect size threshold of \(|\log_2FC| > 1.5\) is applied directly.

Significant DEGs

This first tier identifies all genes that pass the statistical significance threshold, regardless of effect size.

{
  cat("\nSignificant DEGs (is_DEG, FDR < 0.05) Count:\n")
  print(with(effects_df, table(predictor, is_DEG)))
}
## 
## Significant DEGs (is_DEG, FDR < 0.05) Count:
##             is_DEG
## predictor    FALSE  TRUE
##   Leaf       16361  7888
##   -P         18078  6171
##   Leaf:-P    21679  2570
##   Inv4m      23471   778
##   Inv4m:-P   24247     2
##   Leaf:Inv4m 24248     1

High-Confidence DEGs

The second tier filters the significant DEGs further by applying the custom large effect size thresholds.

is_large_effect <- rep(FALSE, nrow(effects_df))
is_leaf <- effects_df$predictor == "Leaf"
is_interaction <- effects_df$predictor == "Leaf:-P" |
  effects_df$predictor == "Inv4m:-P" |
  effects_df$predictor == "Leaf:Inv4m"

is_large_effect[is_leaf & abs(effects_df$logFC) > 0.5] <- TRUE
is_large_effect[is_interaction & abs(effects_df$logFC) > 0.5] <- TRUE
is_large_effect[!is_leaf & abs(effects_df$logFC) > 1.5] <- TRUE

effects_df <- effects_df %>%
  mutate(is_hiconf_DEG = is_DEG & is_large_effect)

{
  cat("\nHigh-Confidence DEGs (is_hiconf_DEG) Count:\n")
  print(with(effects_df, table(predictor, is_hiconf_DEG)))
  print(
    with(
      effects_df %>% filter(is_hiconf_DEG),
      table(predictor, regulation, is_hiconf_DEG)
    ) %>%
      as_tibble() %>%
      arrange(desc(predictor), desc(regulation))
  )
}
## 
## High-Confidence DEGs (is_hiconf_DEG) Count:
##             is_hiconf_DEG
## predictor    FALSE  TRUE
##   Leaf       22818  1431
##   -P         23455   794
##   Leaf:-P    23684   565
##   Inv4m      24084   165
##   Inv4m:-P   24247     2
##   Leaf:Inv4m 24248     1
## # A tibble: 12 × 4
##    predictor  regulation    is_hiconf_DEG     n
##    <chr>      <chr>         <chr>         <int>
##  1 Leaf:Inv4m Upregulated   TRUE              0
##  2 Leaf:Inv4m Downregulated TRUE              1
##  3 Leaf:-P    Upregulated   TRUE            367
##  4 Leaf:-P    Downregulated TRUE            198
##  5 Leaf       Upregulated   TRUE            682
##  6 Leaf       Downregulated TRUE            749
##  7 Inv4m:-P   Upregulated   TRUE              0
##  8 Inv4m:-P   Downregulated TRUE              2
##  9 Inv4m      Upregulated   TRUE             79
## 10 Inv4m      Downregulated TRUE             86
## 11 -P         Upregulated   TRUE            582
## 12 -P         Downregulated TRUE            212

Selected DEGs (Rank-Based)

The final tier, is_selected_DEG, selects the most interesting genes for visualization and detailed annotation.

rank_threshold <- 10

effects_df <- effects_df %>%
  group_by(is_hiconf_DEG, predictor, regulation) %>%
  mutate(
    pval_rank = row_number(dplyr::desc(neglogP)),
    mahal_rank = row_number(dplyr::desc(mahalanobis))
  ) %>%
  ungroup() %>%
  mutate(
    is_selected_DEG = (pval_rank <= rank_threshold & is_hiconf_DEG) |
      (mahal_rank <= rank_threshold & is_hiconf_DEG)
  )

{
  cat(sprintf(
    "\nSelected DEGs (is_selected_DEG, Top N=%d by Rank) Count:\n",
    rank_threshold
  ))
  with(
    effects_df %>% filter(is_selected_DEG & regulation != "Unregulated"),
    table(regulation, predictor, is_selected_DEG)
  )
}
## 
## Selected DEGs (is_selected_DEG, Top N=10 by Rank) Count:
## , , is_selected_DEG = TRUE
## 
##                predictor
## regulation      Leaf -P Leaf:-P Inv4m Inv4m:-P Leaf:Inv4m
##   Downregulated   19 17      13    10        2          1
##   Upregulated     19 17      14    14        0          0

Annotate with Gene Information

# Join gene annotations (locus_name, desc come from gene_pannzer)
effects_df <- effects_df %>%
  left_join(
    gene_pannzer,
    by = c(gene = "gene_model"),
    relationship = "many-to-many"
  ) %>%
  mutate(desc_merged = coalesce(locus_name, desc)) %>%
  dplyr::select(predictor, regulation, gene, locus_symbol, desc_merged, everything()) %>%
  inner_join(
    genes %>%
      dplyr::select(gene = ID, CHR = seqnames, BP = start) %>%
      mutate(CHR = as.character(CHR) %>% as.integer()),
    by = "gene"
  )

# Make locus_symbol the default locus_label
# Remove symbols corresponding to DNA markers in the consensus map
effects_df <- effects_df %>%
  mutate(
    locus_label = case_when(
      is.na(locus_symbol) ~ NA_character_,
      grepl("^si\\d*[a-h]", locus_symbol) ~ NA_character_,
      grepl("^umc", locus_symbol) ~ NA_character_,
      grepl("^csu", locus_symbol) ~ NA_character_,
      grepl("^bnlg", locus_symbol) ~ NA_character_,
      grepl("^php\\d\\d", locus_symbol) ~ NA_character_,
      grepl("^csu\\d+\a", locus_symbol) ~ NA_character_,
      grepl("^pco", locus_symbol) ~ NA_character_,
      grepl("^IDP", locus_symbol) ~ NA_character_,
      grepl("^TIDP\\d{4}", locus_symbol) ~ NA_character_,
      grepl("^cl\\d*_\\d", locus_symbol) ~ NA_character_,
      grepl("^cl\\d*_-", locus_symbol) ~ NA_character_,
      grepl("^Zm00001eb", locus_symbol) ~ NA_character_,
      grepl("^Zm00001d", locus_symbol) ~ NA_character_,
      grepl("^GRM", locus_symbol) ~ NA_character_,
      grepl("LOC\\d{4}", locus_symbol) ~ NA_character_,
      TRUE ~ locus_symbol
    )
  )

{
  cat("\nAnnotations added:\n")
  cat("  Final columns:", ncol(effects_df), "\n")
  cat("  Genes with coordinates:\n")
  print(with(effects_df, table(predictor, !is.na(effects_df$CHR))))
}
## 
## Annotations added:
##   Final columns: 27 
##   Genes with coordinates:
##             
## predictor     TRUE
##   Leaf       24011
##   -P         24011
##   Leaf:-P    24011
##   Inv4m      24011
##   Inv4m:-P   24011
##   Leaf:Inv4m 24011
# Add curated locus labels
curated <- read.csv(file.path(paths$data, "selected_DEGs_curated_locus_label_2.csv")) %>%
  dplyr::select(gene, locus_label, desc_merged) %>%
  group_by(gene) %>%
  dplyr::slice(1) %>%
  ungroup()

# Coalesce curated locus_label into effects_df
effects_df <- effects_df %>%
  left_join(curated, by = "gene", suffix = c("", "_curated")) %>%
  mutate(desc_merged = ifelse(desc_merged == "", NA, desc_merged)) %>%
  mutate(
    locus_label = coalesce(locus_label_curated, locus_label),
    desc_merged = coalesce(desc_merged_curated, desc_merged)
  ) %>%
  dplyr::select(-locus_label_curated, -desc_merged_curated)

Add Region Classification

Classify genes by genomic location relative to Inv4m.

#' Count unique genes in a specified genomic region
#'
#' @param effects_df Data frame containing gene annotations and region indicators
#' @param region Character string specifying the region column name
#'
#' @return Integer count of unique genes in the specified region
count_genes_in_region <- function(effects_df, region) {
  stopifnot(
    is.data.frame(effects_df),
    is.character(region),
    region %in% names(effects_df),
    "gene" %in% names(effects_df)
  )

  effects_df %>%
    filter(.data[[region]]) %>%
    distinct(gene) %>%
    nrow()
}

effects_df <- effects_df %>%
  mutate(
    in_Inv4m = gene %in% inv4m_gene_ids,
    in_cis = gene %in% shared_introgression_gene_ids,
    in_flank = gene %in% flanking_introgression_gene_ids,
    in_trans = !in_cis
  )

regions <- c("in_Inv4m", "in_cis", "in_flank", "in_trans")

{
  cat("\nRegion classification:\n")
  print(sapply(regions, function(r) count_genes_in_region(effects_df, r)))
}
## 
## Region classification:
## in_Inv4m   in_cis in_flank in_trans 
##      253      608      355    23403

Regional Enrichment Analysis

Test whether DEGs are enriched in specific genomic regions using Fisher’s exact test.

#' Run Fisher's exact test for regional enrichment
#'
#' @param data Data frame with region and DEG indicators
#' @param region_col Column name for region indicator
#' @param deg_col Column name for DEG indicator
#'
#' @return Tibble with enrichment statistics
run_fisher <- function(data, region_col, deg_col) {
  contingency <- table(data[[region_col]], data[[deg_col]])
  fisher_result <- fisher.test(contingency)

  tibble(
    in_region_DEG = contingency[2, 2],
    in_region_total = sum(contingency[2, ]),
    outside_DEG = contingency[1, 2],
    outside_total = sum(contingency[1, ]),
    odds_ratio = fisher_result$estimate,
    p_value = fisher_result$p.value,
    enrichment = (contingency[2, 2] / sum(contingency[2, ])) /
      (contingency[1, 2] / sum(contingency[1, ]))
  )
}

DEG Enrichment by Region

# Create separate data frame for Inv4m predictor
Region_effects <- effects_df %>%
  filter(predictor == "Inv4m") %>%
  mutate(outside = in_trans)

# All DEG enrichment tests
deg_tests <- bind_rows(
  run_fisher(Region_effects, "in_cis", "is_DEG") %>%
    mutate(comparison = "Shared introgression vs Outside"),
  run_fisher(
    Region_effects %>% filter(in_flank | outside),
    "in_flank",
    "is_DEG"
  ) %>%
    mutate(comparison = "Flanking vs Outside"),
  run_fisher(
    Region_effects %>% filter(in_Inv4m | outside),
    "in_Inv4m",
    "is_DEG"
  ) %>%
    mutate(comparison = "Inv4m vs Outside"),
  run_fisher(
    Region_effects %>% filter(in_cis),
    "in_Inv4m",
    "is_DEG"
  ) %>%
    mutate(comparison = "Inv4m vs Flanking")
) %>%
  dplyr::select(comparison, everything())

{
  cat("DEG Enrichment (FDR < 0.05):\n")
  print(deg_tests)
}
## DEG Enrichment (FDR < 0.05):
## # A tibble: 4 × 8
##   comparison  in_region_DEG in_region_total outside_DEG outside_total odds_ratio
##   <chr>               <int>           <int>       <int>         <int>      <dbl>
## 1 Shared int…           301             608         475         23403      47.3 
## 2 Flanking v…           173             355         475         23403      45.8 
## 3 Inv4m vs O…           128             253         475         23403      49.4 
## 4 Inv4m vs F…           128             253         173           355       1.08
## # ℹ 2 more variables: p_value <dbl>, enrichment <dbl>

High-Confidence DEG Enrichment

hiconf_deg_tests <- bind_rows(
  run_fisher(Region_effects, "in_cis", "is_hiconf_DEG") %>%
    mutate(comparison = "Shared introgression vs Outside"),
  run_fisher(
    Region_effects %>% filter(in_flank | outside),
    "in_flank",
    "is_hiconf_DEG"
  ) %>%
    mutate(comparison = "Flanking vs Outside"),
  run_fisher(
    Region_effects %>% filter(in_Inv4m | outside),
    "in_Inv4m",
    "is_hiconf_DEG"
  ) %>%
    mutate(comparison = "Inv4m vs Outside"),
  run_fisher(
    Region_effects %>% filter(in_cis),
    "in_Inv4m",
    "is_hiconf_DEG"
  ) %>%
    mutate(comparison = "Inv4m vs Flanking")
) %>%
  dplyr::select(comparison, everything())

{
  cat("\nHigh-confidence DEG Enrichment (FDR < 0.05, |logFC| > 1.5):\n")
  print(hiconf_deg_tests)
}
## 
## High-confidence DEG Enrichment (FDR < 0.05, |logFC| > 1.5):
## # A tibble: 4 × 8
##   comparison  in_region_DEG in_region_total outside_DEG outside_total odds_ratio
##   <chr>               <int>           <int>       <int>         <int>      <dbl>
## 1 Shared int…           119             608          46         23403    123.   
## 2 Flanking v…            76             355          46         23403    138.   
## 3 Inv4m vs O…            43             253          46         23403    104.   
## 4 Inv4m vs F…            43             253          76           355      0.752
## # ℹ 2 more variables: p_value <dbl>, enrichment <dbl>

Enrichment Interpretation

{
  cat("\n=== Key Findings ===\n")
  cat("1. DEGs are enriched in Inv4m region vs genome-wide\n")
  cat("2. Both Inv4m and flanking show enrichment vs outside\n")
  cat("3. Regional enrichment pattern observed for high-confidence DEGs\n")
  cat("\nInterpretation: The introgression region (inversion + flanking)\n")
  cat("shows elevated differential expression patterns.\n")
}
## 
## === Key Findings ===
## 1. DEGs are enriched in Inv4m region vs genome-wide
## 2. Both Inv4m and flanking show enrichment vs outside
## 3. Regional enrichment pattern observed for high-confidence DEGs
## 
## Interpretation: The introgression region (inversion + flanking)
## shows elevated differential expression patterns.

Quality Control Tables

Top 10 DEGs per Predictor and Regulation

Showing top differentially expressed genes by adjusted p-value.

top_degs_qc <- effects_df %>%
  filter(is_DEG, regulation != "Unregulated") %>%
  group_by(predictor, regulation) %>%
  arrange(-neglogP, .by_group = TRUE) %>%
  dplyr::slice(1:10) %>%
  dplyr::select(
    predictor, gene, locus_symbol,
    desc_merged, logFC, neglogP
  ) %>%
  arrange(predictor, regulation, -neglogP)

{
  cat("\nTop 10 DEGs per predictor and regulation:\n")
  print(top_degs_qc)
}
## 
## Top 10 DEGs per predictor and regulation:
## # A tibble: 83 × 7
## # Groups:   predictor, regulation [10]
##    regulation    predictor gene          locus_symbol desc_merged  logFC neglogP
##    <chr>         <fct>     <chr>         <chr>        <chr>        <dbl>   <dbl>
##  1 Downregulated Leaf      Zm00001eb152… umc1690      Transcript… -1.48     7.46
##  2 Downregulated Leaf      Zm00001eb151… <NA>         NTF2 domai… -1.15     7.46
##  3 Downregulated Leaf      Zm00001eb076… <NA>         Protein ST… -0.945    7.46
##  4 Downregulated Leaf      Zm00001eb038… <NA>         Mitochondr… -0.699    7.46
##  5 Downregulated Leaf      Zm00001eb329… <NA>         Tyrosine-s… -0.632    7.46
##  6 Downregulated Leaf      Zm00001eb182… <NA>         protein MA… -0.606    7.46
##  7 Downregulated Leaf      Zm00001eb176… <NA>         photosynth… -0.521    7.46
##  8 Downregulated Leaf      Zm00001eb184… <NA>         Eukaryotic… -0.455    7.46
##  9 Downregulated Leaf      Zm00001eb391… <NA>         Short-chai… -0.604    7.37
## 10 Downregulated Leaf      Zm00001eb034… cl11031_1a   Dilute dom… -0.498    7.37
## # ℹ 73 more rows

Top Mahalanobis Outliers

Most extreme differentially expressed genes.

top_outliers_qc <- effects_df %>%
  filter(is_mh_outlier) %>%
  group_by(predictor, regulation) %>%
  arrange(-mahalanobis, .by_group = TRUE) %>%
  dplyr::slice(1:10) %>%
  dplyr::select(
    predictor, regulation, gene,
    locus_symbol, desc_merged,
    logFC, neglogP, mahalanobis
  ) %>%
  arrange(-neglogP)

{
  cat("\nTop Mahalanobis outliers per predictor and regulation:\n")
  print(top_outliers_qc)
}
## 
## Top Mahalanobis outliers per predictor and regulation:
## # A tibble: 83 × 8
## # Groups:   predictor, regulation [10]
##    predictor regulation gene  locus_symbol desc_merged logFC neglogP mahalanobis
##    <fct>     <chr>      <chr> <chr>        <chr>       <dbl>   <dbl>       <dbl>
##  1 Inv4m     Downregul… Zm00… jmj2         JUMONJI-tr… -3.55    22.3      12298.
##  2 Inv4m     Downregul… Zm00… jmj4         JUMONJI-tr… -2.88    20.0      10032.
##  3 Inv4m     Downregul… Zm00… <NA>         SUMO-activ… -1.99    18.0       8392.
##  4 Inv4m     Upregulat… Zm00… <NA>         Exocyst co…  2.82    18.0      11775.
##  5 Inv4m     Downregul… Zm00… <NA>         Methionyl-… -2.91    17.7       7693.
##  6 Inv4m     Downregul… Zm00… <NA>         CCT-alpha   -1.72    16.3       6898.
##  7 Inv4m     Downregul… Zm00… IDP4515      DnaJ-like … -1.48    16.1       6843.
##  8 Inv4m     Upregulat… Zm00… <NA>         <NA>         6.54    16.1      13441.
##  9 Inv4m     Downregul… Zm00… cl19648_2a   ADP-ribosy… -2.83    16.1       6293.
## 10 Inv4m     Upregulat… Zm00… cl19799_1    <NA>         1.95    16.1       8922.
## # ℹ 73 more rows

DEG Summary Statistics

# Overall DEG counts by predictor
overall_summary <- effects_df %>%
  group_by(predictor) %>%
  summarise(
    total_genes = n(),
    n_significant = sum(is_DEG),
    n_DEG = sum(is_DEG),
    n_hiconf_DEG = sum(is_hiconf_DEG),
    n_selected_DEG = sum(is_selected_DEG),
    pct_DEG = round(100 * n_DEG / total_genes, 2)
  )

# Region distribution for Inv4m effect
inv4m_region_summary <- effects_df %>%
  filter(predictor == "Inv4m", is_DEG) %>%
  group_by(regulation, in_Inv4m, in_cis) %>%
  summarise(n = n(), .groups = "drop")

{
  cat("\n=== DEG Summary Statistics ===\n")
  print(overall_summary)
  cat("\nInv4m DEGs by region and regulation:\n")
  print(inv4m_region_summary)
}
## 
## === DEG Summary Statistics ===
## # A tibble: 6 × 7
##   predictor  total_genes n_significant n_DEG n_hiconf_DEG n_selected_DEG pct_DEG
##   <fct>            <int>         <int> <int>        <int>          <int>   <dbl>
## 1 Leaf             24011          7870  7870         1422             37   32.8 
## 2 -P               24011          6155  6155          792             34   25.6 
## 3 Leaf:-P          24011          2550  2550          555             24   10.6 
## 4 Inv4m            24011           776   776          165             24    3.23
## 5 Inv4m:-P         24011             2     2            2              2    0.01
## 6 Leaf:Inv4m       24011             1     1            1              1    0   
## 
## Inv4m DEGs by region and regulation:
## # A tibble: 6 × 4
##   regulation    in_Inv4m in_cis     n
##   <chr>         <lgl>    <lgl>  <int>
## 1 Downregulated FALSE    FALSE    349
## 2 Downregulated FALSE    TRUE     106
## 3 Downregulated TRUE     TRUE      84
## 4 Upregulated   FALSE    FALSE    126
## 5 Upregulated   FALSE    TRUE      67
## 6 Upregulated   TRUE     TRUE      44

Selected DEGs for Manuscript

Extract selected DEGs for detailed presentation in manuscript tables.

selected_degs <- effects_df %>%
  filter(is_selected_DEG) %>%
  dplyr::select(
    predictor,
    regulation,
    gene,
    locus_symbol,
    locus_label,
    desc_merged,
    std_err,
    logFC,
    neglogP,
    mahalanobis,
    pval_rank,
    mahal_rank
  ) %>%
  arrange(predictor, regulation, -neglogP)

{
  cat("\n=== Selected DEGs for Manuscript ===\n")
  cat("Total selected DEGs:", nrow(selected_degs), "\n")
  cat("Counts by predictor and regulation:\n")
  print(with(selected_degs, table(predictor, regulation)))
}
## 
## === Selected DEGs for Manuscript ===
## Total selected DEGs: 122 
## Counts by predictor and regulation:
##             regulation
## predictor    Downregulated Upregulated
##   Leaf                  18          19
##   -P                    17          17
##   Leaf:-P               13          11
##   Inv4m                 10          14
##   Inv4m:-P               2           0
##   Leaf:Inv4m             1           0

Selected DEGs: Phosphorus Effect

Export selected DEGs specific to the phosphorus effect with pannzer description.

p_selected <- selected_degs %>%
  mutate(
    regulation = factor(
      regulation,
      levels = c("Upregulated", "Downregulated")
    )
  ) %>%
  filter(predictor == "-P") %>%
  arrange(regulation, -neglogP)

{
  cat("\nPhosphorus effect selected DEGs:\n")
  cat("  Upregulated:", sum(p_selected$regulation == "Upregulated"), "\n")
  cat("  Downregulated:", sum(p_selected$regulation == "Downregulated"), "\n")
  print(p_selected)
}
## 
## Phosphorus effect selected DEGs:
##   Upregulated: 17 
##   Downregulated: 17 
## # A tibble: 34 × 12
##    predictor regulation gene  locus_symbol locus_label desc_merged std_err logFC
##    <fct>     <fct>      <chr> <chr>        <chr>       <chr>         <dbl> <dbl>
##  1 -P        Upregulat… Zm00… pilncr1      pilncr1     pi-deficie…   0.690  7.70
##  2 -P        Upregulat… Zm00… ips1         ips1        induced by…   0.658  7.10
##  3 -P        Upregulat… Zm00… gpx1         gpx1        glyceropho…   0.625  6.84
##  4 -P        Upregulat… Zm00… pap2         pap2        purple aci…   0.411  4.64
##  5 -P        Upregulat… Zm00… pco107612    <NA>        Inorganic …   0.284  3.06
##  6 -P        Upregulat… Zm00… pfk1         pfk1        phosphofru…   0.238  2.58
##  7 -P        Upregulat… Zm00… plc6         plc6        phospholip…   0.174  1.90
##  8 -P        Upregulat… Zm00… <NA>         <NA>        FLZ-type d…   0.284  3.03
##  9 -P        Upregulat… Zm00… GRMZM2G1355… rfk1        Riboflavin…   0.375  3.98
## 10 -P        Upregulat… Zm00… <NA>         <NA>        Mannose-1-…   0.219  2.29
## # ℹ 24 more rows
## # ℹ 4 more variables: neglogP <dbl>, mahalanobis <dbl>, pval_rank <int>,
## #   mahal_rank <int>

Selected DEGs: Leaf Effect

leaf_selected <- selected_degs %>%
  mutate(
    regulation = factor(
      regulation,
      levels = c("Upregulated", "Downregulated")
    )
  ) %>%
  filter(predictor == "Leaf") %>%
  arrange(regulation, -neglogP)

{
  cat("\nLeaf stage effect selected DEGs:\n")
  cat("  Upregulated:", sum(leaf_selected$regulation == "Upregulated"), "\n")
  cat("  Downregulated:", sum(leaf_selected$regulation == "Downregulated"), "\n")
  print(leaf_selected)
}
## 
## Leaf stage effect selected DEGs:
##   Upregulated: 19 
##   Downregulated: 18 
## # A tibble: 37 × 12
##    predictor regulation gene  locus_symbol locus_label desc_merged std_err logFC
##    <fct>     <fct>      <chr> <chr>        <chr>       <chr>         <dbl> <dbl>
##  1 Leaf      Upregulat… Zm00… hir3         hir3        hypersensi…  0.0861 0.801
##  2 Leaf      Upregulat… Zm00… IDP641       <NA>        Glycosyltr…  0.119  1.09 
##  3 Leaf      Upregulat… Zm00… cyp6         cyp6        cytochrome…  0.0993 0.905
##  4 Leaf      Upregulat… Zm00… bhlh145      bhlh145     bHLH-trans…  0.101  0.889
##  5 Leaf      Upregulat… Zm00… umc2170      <NA>        DNAJ heat …  0.0770 0.645
##  6 Leaf      Upregulat… Zm00… salt1        salt1       SalT homol…  0.311  2.54 
##  7 Leaf      Upregulat… Zm00… <NA>         <NA>        <NA>         0.0953 0.735
##  8 Leaf      Upregulat… Zm00… trpp2        trpp2       trehalose-…  0.191  1.47 
##  9 Leaf      Upregulat… Zm00… wrky111      wrky111     WRKY-trans…  0.0794 0.605
## 10 Leaf      Upregulat… Zm00… <NA>         <NA>        Surfactant…  0.0694 0.528
## # ℹ 27 more rows
## # ℹ 4 more variables: neglogP <dbl>, mahalanobis <dbl>, pval_rank <int>,
## #   mahal_rank <int>

Selected DEGs: Leaf:Treatment Interaction

Export selected DEGs specific to the leaf:treatment interaction.

interaction_selected <- selected_degs %>%
  mutate(
    regulation = factor(
      regulation,
      levels = c("Upregulated", "Downregulated")
    )
  ) %>%
  group_by(regulation) %>%
  filter(predictor == "Leaf:-P") %>%
  arrange(regulation, -neglogP)

{
  cat("\nLeaf:Treatment interaction selected DEGs:\n")
  cat("  Upregulated:", sum(interaction_selected$regulation == "Upregulated"), "\n")
  cat("  Downregulated:",
      sum(interaction_selected$regulation == "Downregulated"), "\n")
  print(interaction_selected)
}
## 
## Leaf:Treatment interaction selected DEGs:
##   Upregulated: 11 
##   Downregulated: 13 
## # A tibble: 24 × 12
## # Groups:   regulation [2]
##    predictor regulation gene  locus_symbol locus_label desc_merged std_err logFC
##    <fct>     <fct>      <chr> <chr>        <chr>       <chr>         <dbl> <dbl>
##  1 Leaf:-P   Upregulat… Zm00… pco139502b   <NA>        Pyruvate k…  0.139  1.18 
##  2 Leaf:-P   Upregulat… Zm00… mrpa3        mrpa3       multidrug …  0.0792 0.643
##  3 Leaf:-P   Upregulat… Zm00… plc6         plc6        phospholip…  0.0791 0.558
##  4 Leaf:-P   Upregulat… Zm00… <NA>         <NA>        Ribonuclea…  0.0888 0.614
##  5 Leaf:-P   Upregulat… Zm00… pld16        pld16       phospholip…  0.0830 0.562
##  6 Leaf:-P   Upregulat… Zm00… <NA>         <NA>        PI-PLC X d…  0.179  1.15 
##  7 Leaf:-P   Upregulat… Zm00… gmp1         gmp1        mannose-1-…  0.110  0.687
##  8 Leaf:-P   Upregulat… Zm00… <NA>         <NA>        Heptahelic…  0.101  0.630
##  9 Leaf:-P   Upregulat… Zm00… <NA>         <NA>        Beta-galac…  0.0838 0.526
## 10 Leaf:-P   Upregulat… Zm00… pah1         pah1        phosphatid…  0.0935 0.581
## # ℹ 14 more rows
## # ℹ 4 more variables: neglogP <dbl>, mahalanobis <dbl>, pval_rank <int>,
## #   mahal_rank <int>

Plot Interactions

Spaghetti Plot

# Load normalized expression data
voomR <- readRDS(file.path(paths$intermediate, "normalized_expression_voom_object_leaf_trt.rds"))

# Define genes of interest - High_confidence interaction genes
interaction_highconf <- effects_df %>%
  filter(is_hiconf_DEG & predictor == "Leaf:-P")

interaction_genes <- interaction_highconf %>%
  pull(gene) %>%
  unique()

# Extract expression matrix for genes of interest
expr_matrix <- voomR$E[interaction_genes, ]

# Convert to long format and join with sample metadata
expr_long <- as.data.frame(t(expr_matrix)) %>%
  tibble::rownames_to_column("sample") %>%
  pivot_longer(
    cols = -sample,
    names_to = "gene",
    values_to = "log2CPM"
  ) %>%
  left_join(
    voomR$targets %>%
      tibble::rownames_to_column("sample") %>%
      dplyr::select(sample, Treatment, leaf_tissue, Genotype),
    by = "sample"
  ) %>%
  left_join(
    interaction_highconf %>%
      filter(predictor == "Leaf:-P") %>%
      dplyr::select(gene, regulation) %>%
      distinct(),
    by = "gene"
  )

# Recode treatment levels
expr_long$leafxP <- factor(expr_long$regulation)
levels(expr_long$leafxP) <- c("Negative", "Positive")

# Center expression per gene-treatment combination
expr_long <- expr_long %>%
  group_by(gene, Treatment) %>%
  mutate(centered_log2CPM = log2CPM - mean(log2CPM)) %>%
  ungroup()

# Calculate mean centered expression per gene per leaf stage per treatment
expr_summary <- expr_long %>%
  group_by(gene, Treatment, leaf_tissue, leafxP) %>%
  summarise(mean_log2CPM = mean(centered_log2CPM), .groups = "drop")

# Create separate datasets for each treatment
expr_minusP <- expr_summary %>% filter(Treatment == "-P")
expr_plusP <- expr_summary %>% filter(Treatment == "+P")

# Create spaghetti plot
base_size <- 30

# Create base plot for +P treatment (plotted first, will be in back)
p_plusP_base <- expr_plusP %>%
  ggplot(aes(x = leaf_tissue, y = mean_log2CPM, color = Treatment)) +
  geom_line(
    aes(group = interaction(gene, Treatment)),
    alpha = 0.1,
    linewidth = 0.8
  ) %>%
  with_shadow(colour = "black", x_offset = 0, y_offset = 0, sigma = 2) +
  geom_smooth(
    aes(group = Treatment),
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    linewidth = 2
  ) %>%
  with_shadow(colour = "black", x_offset = 0, y_offset = 0, sigma = 2) +
  facet_wrap(~leafxP, ncol = 2) +
  labs(
    x = "Leaf",
    y = expression("Centered " * log[2] * "(CPM)")
  ) +
  theme_classic(base_size = base_size) +
  theme(
    strip.background = element_blank(),
    strip.text.x = element_blank(),
    plot.title = element_blank(),
    axis.title.y = element_text(margin = margin(r = -10)),
    plot.margin = margin(5.5, 7, 50, 5.5, "pt")
  )

# Overlay -P data on top (plotted second, will be in front)
p_spaghetti <- p_plusP_base +
  geom_line(
    data = expr_minusP,
    aes(
      x = leaf_tissue, y = mean_log2CPM, color = Treatment,
      group = interaction(gene, Treatment)
    ),
    alpha = 0.1,
    linewidth = 0.8
  ) +
  geom_smooth(
    data = expr_minusP,
    aes(
      x = leaf_tissue, y = mean_log2CPM, color = Treatment,
      group = Treatment
    ),
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    linewidth = 2
  ) %>%
  with_shadow(colour = "black", x_offset = 0, y_offset = 0, sigma = 2) +
  scale_color_manual(
    values = c("+P" = "gold", "-P" = "purple4"),
    name = "Treatment"
  ) +
  guides(color = guide_legend(
    override.aes = list(linewidth = 2, alpha = 1)
  )) +
  theme(
    legend.position = c(0.4, 0.95),
    legend.title = element_blank(),
    legend.background = element_rect(fill = "transparent"),
    legend.box.background = element_rect(fill = "transparent", color = NA)
  )

p_spaghetti

# Select top annotated genes per regulation direction
interaction_for_profiles <- interaction_highconf %>%
  mutate(regulation = factor(
    regulation,
    levels = c("Downregulated", "Upregulated")
  )) %>%
  filter(predictor == "Leaf:-P") %>%
  group_by(regulation) %>%
  arrange(regulation, -neglogP) %>%
  dplyr::slice(1:12) %>%
  ungroup() %>%
  mutate(locus_label = paste0(
    locus_label, " ", gene, "\n",
    str_wrap(coalesce(locus_name, desc_merged),
      width = 30,
      whitespace_only = FALSE
    )
  )) %>%
  mutate(locus_label = gsub("pco139502b |NA ", "", locus_label))

interaction_for_profiles$locus_label <- gsub("^ +", "", interaction_for_profiles$locus_label)

interaction_for_profiles$gene
##  [1] "Zm00001eb359280" "Zm00001eb207130" "Zm00001eb389720" "Zm00001eb070520"
##  [5] "Zm00001eb212520" "Zm00001eb179680" "Zm00001eb111630" "Zm00001eb362560"
##  [9] "Zm00001eb214780" "Zm00001eb071770" "Zm00001eb004410" "Zm00001eb325410"
## [13] "Zm00001eb157810" "Zm00001eb376160" "Zm00001eb063230" "Zm00001eb144680"
## [17] "Zm00001eb339870" "Zm00001eb393060" "Zm00001eb148030" "Zm00001eb009430"
## [21] "Zm00001eb011050" "Zm00001eb289800" "Zm00001eb297970" "Zm00001eb040890"
# Filter and annotate existing expr_long
expr_plot <- expr_long %>%
  inner_join(interaction_for_profiles) %>%
  group_by(regulation) %>%
  mutate(locus_label = forcats::fct_reorder(locus_label, -neglogP))

# Define gene display order (top 2 per direction)
gene_order <- interaction_for_profiles %>%
  group_by(regulation) %>%
  arrange(regulation, -neglogP) %>%
  dplyr::slice(1:2) %>%
  pull(gene)

gene_order
## [1] "Zm00001eb359280" "Zm00001eb207130" "Zm00001eb157810" "Zm00001eb376160"

Reaction Norm Plot

p_gene <- expr_plot %>%
  filter(gene %in% gene_order) %>%
  group_by(regulation) %>%
  arrange(regulation, -neglogP) %>%
  ungroup() %>%
  ggplot(aes(
    x = leaf_tissue, y = log2CPM,
    color = Treatment,
    fill = Treatment,
    group = Treatment,
    shape = Treatment
  )) +
  stat_summary(fun = mean, geom = "line", linewidth = 1) +
  stat_summary(fun = mean, geom = "point", size = 5) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.1, linewidth = 1.5) +
  scale_shape_manual(values = c(19, 25)) +
  scale_fill_manual(
    values = c("+P" = "gold", "-P" = "purple4"),
    name = "Treatment"
  ) +
  scale_color_manual(
    values = c("+P" = "gold", "-P" = "purple4"),
    name = "Treatment"
  ) +
  guides(
    color = "none",
    shape = guide_legend(
      override.aes = list(color = c("gold", "purple4"))
    )
  ) +
  facet_wrap(~locus_label, scales = "free_y", ncol = 2, dir = "v") +
  labs(
    x = "Leaf",
    y = expression(log[2] * "(CPM)")
  ) +
  theme_classic(base_size = base_size) +
  theme(
    strip.background = element_blank(),
    strip.text = element_text(size = 16, face = "bold", hjust = 0),
    plot.margin = margin(-53, 5.5, 5.5, 5.5, "pt"),
    legend.background = element_rect(fill = "transparent"),
    legend.box.background = element_rect(fill = "transparent", color = NA),
    legend.position = c(0.075, 0.65),
    legend.title = element_blank()
  )

p_gene

Plot Individual Gene Interaction

plot_genes <- interaction_for_profiles %>%
  group_by(regulation) %>%
  dplyr::select(predictor:gene, locus_label, desc_merged, neglogP) %>%
  distinct() %>%
  arrange(regulation, -neglogP) %>%
  dplyr::slice(1:10) %>%
  pull(gene)

p_interaction <- expr_plot %>%
  filter(gene %in% plot_genes) %>%
  group_by(regulation) %>%
  arrange(regulation, -neglogP) %>%
  ggplot(aes(
    x = leaf_tissue, y = log2CPM,
    color = Treatment, group = Treatment
  )) +
  stat_summary(fun = mean, geom = "line", linewidth = 1) +
  stat_summary(fun = mean, geom = "point", size = 5) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.1, linewidth = 1.5) +
  scale_color_manual(
    values = c("+P" = "gold", "-P" = "purple4"),
    name = "Treatment"
  ) +
  guides(color = guide_legend(reverse = TRUE)) +
  facet_wrap(~locus_label,
    scales = "free_y",
    ncol = 5, nrow = 4
  ) +
  labs(
    x = "Leaf",
    y = expression(log[2] * "(CPM)")
  ) +
  theme_classic(base_size = base_size) +
  theme(
    strip.background = element_blank(),
    strip.text = element_text(size = 15, face = "bold", hjust = 0),
    legend.background = element_rect(fill = "transparent"),
    legend.box.background = element_rect(fill = "transparent", color = NA),
    legend.position = "top",
    legend.title = element_blank()
  )

p_interaction

Plot Right Figure Panel

# Combine plots
right_panel <- cowplot::plot_grid(
  p_spaghetti, p_gene,
  ncol = 1,
  align = "h",
  axis = "lr"
)

ggsave(
  file.path(paths$figures, "right_panel.png"),
  plot = right_panel,
  width = 8,
  height = 15,
  dpi = 150
)

right_panel

{
  cat("Spaghetti plot created with shadows\n")
  cat("Total unique genes plotted:", length(plot_genes), "\n")
}
## Spaghetti plot created with shadows
## Total unique genes plotted: 20

Export Results

Export Full Effects Table

# Full effects table with all columns
write.csv(
  effects_df,
  file = file.path(paths$intermediate, "predictor_effects_leaf_interaction_model.csv"),
  row.names = FALSE
)

{
  cat("\nFull effects table exported:\n")
  cat("  predictor_effects_leaf_interaction_model.csv\n")
}
## 
## Full effects table exported:
##   predictor_effects_leaf_interaction_model.csv

Export Selected DEGs

# Selected DEGs for manuscript
write.csv(
  selected_degs,
  file = file.path(paths$intermediate, "selected_DEGs_leaf_interaction_model.csv"),
  row.names = FALSE
)

# Phosphorus selected DEGs
write.csv(
  p_selected,
  file = file.path(paths$intermediate, "p_selected_DEGs_leaf_interaction_model.csv"),
  row.names = FALSE
)

# Leaf selected DEGs
write.csv(
  leaf_selected,
  file = file.path(paths$intermediate, "leaf_selected_DEGs_leaf_interaction_model.csv"),
  row.names = FALSE
)

# Inv4m selected DEGs
write.csv(
  selected_degs %>% filter(predictor == "Inv4m"),
  file = file.path(paths$intermediate, "inv4m_selected_DEGs_leaf_interaction_model.csv"),
  row.names = FALSE
)

# Leaf:Treatment interaction selected DEGs
write.csv(
  interaction_selected,
  file = file.path(paths$intermediate, "leafxP_selected_DEGs_leaf_interaction_model.csv"),
  row.names = FALSE
)

{
  cat("\nSelected DEG tables exported:\n")
  cat("  selected_DEGs_leaf_interaction_model.csv - All selected DEGs\n")
  cat("  p_selected_DEGs_leaf_interaction_model.csv - Phosphorus effect\n")
  cat("  leaf_selected_DEGs_leaf_interaction_model.csv - Leaf effect\n")
  cat("  inv4m_selected_DEGs_leaf_interaction_model.csv - Inv4m effect\n")
  cat("  leafxP_selected_DEGs_leaf_interaction_model.csv - Leaf:Treatment interaction\n")
}
## 
## Selected DEG tables exported:
##   selected_DEGs_leaf_interaction_model.csv - All selected DEGs
##   p_selected_DEGs_leaf_interaction_model.csv - Phosphorus effect
##   leaf_selected_DEGs_leaf_interaction_model.csv - Leaf effect
##   inv4m_selected_DEGs_leaf_interaction_model.csv - Inv4m effect
##   leafxP_selected_DEGs_leaf_interaction_model.csv - Leaf:Treatment interaction

Summary

This analysis identified differentially expressed genes across four main effects:

  • Inv4m genotype: Genes with different expression in Inv4m vs Control (|logFC| > 1.5, FDR < 0.05)
  • Leaf position gradient: Genes showing expression changes along the apical-basal axis (|logFC| > 0.5, FDR < 0.05)
  • Phosphorus deficiency: Genes responding to low P treatment (|logFC| > 1.5, FDR < 0.05)
  • Leaf × Treatment interaction: Genes where the leaf gradient differs between P treatments (|logFC| > 1.5, FDR < 0.05)

Key Findings

  • MDS analysis revealed that:
    • Dimension 1 correlates with phosphorus treatment
    • Dimension 2 correlates with leaf tissue position
    • Dimension 3 correlates with genotype
  • Spatial covariates (Plot_Row, Plot_Column) were included to account for field position effects
  • Region enrichment: DEGs show enrichment in the Inv4m region, with patterns observed across both inversion and flanking regions

Three-tier classification:

  • Significant DEGs (FDR < 0.05): 17354 genes
  • High-confidence DEGs (FDR < 0.05, |logFC| > 1.5): 2937 genes
  • Selected DEGs (top 10 by significance/Mahalanobis): 122 genes

Output Files

All results include:

  • Effect sizes (log2 fold change) with 95% confidence intervals
  • FDR-adjusted p-values
  • Gene annotations (symbols and descriptions)
  • Genomic coordinates
  • Region classification (cis/trans, within Inv4m)
  • Mahalanobis outlier flags
  • Three-tier DEG classifications

The full effects table contains 144066 gene × predictor combinations.

Session Information

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] robustbase_0.99-6    ggtext_0.1.2         ggfx_1.0.3          
##  [4] ggpubr_0.6.2         ggplot2_4.0.1        stringr_1.6.0       
##  [7] tidyr_1.3.1          dplyr_1.1.4          rtracklayer_1.70.0  
## [10] GenomicRanges_1.62.1 Seqinfo_1.0.0        IRanges_2.44.0      
## [13] S4Vectors_0.48.0     BiocGenerics_0.56.0  generics_0.1.4      
## [16] edgeR_4.8.1          limma_3.66.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-9                rlang_1.1.6                
##  [3] magrittr_2.0.4              matrixStats_1.5.0          
##  [5] compiler_4.5.2              mgcv_1.9-4                 
##  [7] systemfonts_1.3.1           vctrs_0.6.5                
##  [9] pkgconfig_2.0.3             crayon_1.5.3               
## [11] fastmap_1.2.0               backports_1.5.0            
## [13] magick_2.9.0                XVector_0.50.0             
## [15] labeling_0.4.3              utf8_1.2.6                 
## [17] Rsamtools_2.26.0            rmarkdown_2.30             
## [19] markdown_2.0                ragg_1.5.0                 
## [21] purrr_1.2.0                 xfun_0.55                  
## [23] cachem_1.1.0                cigarillo_1.0.0            
## [25] litedown_0.9                jsonlite_2.0.0             
## [27] DelayedArray_0.36.0         BiocParallel_1.44.0        
## [29] broom_1.0.11                parallel_4.5.2             
## [31] R6_2.6.1                    bslib_0.9.0                
## [33] stringi_1.8.7               RColorBrewer_1.1-3         
## [35] car_3.1-3                   jquerylib_0.1.4            
## [37] Rcpp_1.1.0                  SummarizedExperiment_1.40.0
## [39] knitr_1.50                  Matrix_1.7-4               
## [41] splines_4.5.2               tidyselect_1.2.1           
## [43] abind_1.4-8                 yaml_2.3.12                
## [45] codetools_0.2-20            curl_7.0.0                 
## [47] lattice_0.22-7              tibble_3.3.0               
## [49] Biobase_2.70.0              withr_3.0.2                
## [51] S7_0.2.1                    evaluate_1.0.5             
## [53] xml2_1.5.1                  Biostrings_2.78.0          
## [55] pillar_1.11.1               MatrixGenerics_1.22.0      
## [57] carData_3.0-5               rprojroot_2.1.1            
## [59] RCurl_1.98-1.17             commonmark_2.0.0           
## [61] scales_1.4.0                glue_1.8.0                 
## [63] tools_4.5.2                 BiocIO_1.20.0              
## [65] locfit_1.5-9.12             ggsignif_0.6.4             
## [67] GenomicAlignments_1.46.0    forcats_1.0.1              
## [69] XML_3.99-0.20               cowplot_1.2.0              
## [71] grid_4.5.2                  nlme_3.1-168               
## [73] restfulr_0.0.16             Formula_1.2-5              
## [75] cli_3.6.5                   textshaping_1.0.4          
## [77] S4Arrays_1.10.1             viridisLite_0.4.2          
## [79] gtable_0.3.6                DEoptimR_1.1-4             
## [81] rstatix_0.7.3               sass_0.4.10                
## [83] digest_0.6.39               SparseArray_1.10.6         
## [85] rjson_0.2.23                farver_2.1.2               
## [87] htmltools_0.5.9             lifecycle_1.0.4            
## [89] httr_1.4.7                  here_1.0.2                 
## [91] statmod_1.5.1               gridtext_0.1.5